In [1]:
import pandas as pd

In [2]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

Review of NumPy arrays

NumPy arrays are fixed-size containers for homogeneous data. For example,

  • an array of integers:

In [3]:
def print_array(a):
    print("{} elements of type {}: {}".format(len(a), a.dtype.name, a))
    
a = np.array([1,2,3])
print_array(a)


3 elements of type int64: [1 2 3]
  • or an array of real numbers:

In [4]:
a = np.array([1.5, 2.5, 3.5])
print_array(a)


3 elements of type float64: [ 1.5  2.5  3.5]
  • or an array of strings:

In [5]:
a = np.array(['cześć', 'software', 'carpentry'])
print_array(a)


3 elements of type str288: ['cześć' 'software' 'carpentry']

Accessing (indexing) elements

Single elements can be retrieved by integer (!) index of the element starting from 0:


In [6]:
a = np.array([101, 102, 103, 104, 105])
print(a[1])


102

Sub-array of consecutive elements can be retrived with a slice:


In [7]:
print(a[1:3])


[102 103]

Two-dimensional arrays

2D arrays in NumPy are like matrices — they have columns and rows. To retrieve an element from the array we need to pass two indices or slices.


In [8]:
a = np.arange(12).reshape(3, 4)
print(a)


[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]

In [9]:
print(a[1, 2])


6

In [10]:
print(a[1:, 2:])


[[ 6  7]
 [10 11]]
EXERCISE: Add the first and third row of the array a

Pandas data structures

Pandas defines two fundamental object types, both built upon NumPy arrays: the Series object, and the DataFrame object.

Series

A Series is a basic holder for one-dimensional labeled data. It can be created like a NumPy array:


In [11]:
s = pd.Series([0.1, 0.2, 0.3, 0.4])
s


Out[11]:
0    0.1
1    0.2
2    0.3
3    0.4
dtype: float64

Attributes of a Series: index and values

The series has a built-in concept of an index, which by default is the numbers 0 through N - 1


In [12]:
s.index


Out[12]:
Int64Index([0, 1, 2, 3], dtype='int64')

You can access the underlying numpy array representation with the .values attribute:


In [13]:
s.values


Out[13]:
array([ 0.1,  0.2,  0.3,  0.4])

We can access series values via the index, just like for NumPy arrays:


In [14]:
s[0]


Out[14]:
0.10000000000000001

Unlike the NumPy array, though, this index can be something other than integers:


In [15]:
s2 = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
s2


Out[15]:
a    0
b    1
c    2
d    3
dtype: int64

In [16]:
s2['c']


Out[16]:
2

It's possible to construct a series directly from a Python dictionary. Let's first define the dictionary.


In [17]:
pop_dict = {'Germany': 81.3, 
            'Belgium': 11.3, 
            'France': 64.3, 
            'United Kingdom': 64.9, 
            'Netherlands': 16.9}
pop_dict['Germany']


Out[17]:
81.3

Trying to access non-existing keys in a dictionary will produce an error:


In [18]:
# pop_dict['Poland']

But we can add new keys easily:


In [19]:
pop_dict['Poland'] = 40
pop_dict


Out[19]:
{'Belgium': 11.3,
 'France': 64.3,
 'Germany': 81.3,
 'Netherlands': 16.9,
 'Poland': 40,
 'United Kingdom': 64.9}

NumPy-style arithmetical operations won't work:


In [20]:
#pop_dict * 1000

Now we construct a Series object from the dictionary.


In [21]:
population = pd.Series(pop_dict)
population


Out[21]:
Belgium           11.3
France            64.3
Germany           81.3
Netherlands       16.9
Poland            40.0
United Kingdom    64.9
dtype: float64

We can index the populations like a dict as expected:


In [22]:
population['France']


Out[22]:
64.299999999999997

but with the power of numpy arrays:


In [23]:
population * 1000


Out[23]:
Belgium           11300
France            64300
Germany           81300
Netherlands       16900
Poland            40000
United Kingdom    64900
dtype: float64

Many things we have seen for NumPy, can also be used with pandas objects.

Slicing:


In [24]:
population['Belgium':'Germany']


Out[24]:
Belgium    11.3
France     64.3
Germany    81.3
dtype: float64

A range of methods:


In [25]:
population.mean()


Out[25]:
46.449999999999996
EXERCISE: Calculate how big is the population of each country relative to France

In [ ]:

EXERCISE: Define the following Series containing prices of beverages:

</div>

Beer              5
Coffee            2.5
Orange Juice      5
Water             2
Wine              6

In [ ]:

DataFrames: Multi-dimensional Data

A DataFrame is a tablular data structure (multi-dimensional object to hold labeled data) comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can think of it as multiple Series object which share the same index.

One of the most common ways of creating a dataframe is from a dictionary of arrays or lists.

Note that in the IPython notebook, the data frame will display in a rich HTML view:


In [28]:
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries


Out[28]:
area capital country population
0 30510 Brussels Belgium 11.3
1 671308 Paris France 64.3
2 357050 Berlin Germany 81.3
3 41526 Amsterdam Netherlands 16.9
4 244820 London United Kingdom 64.9

Attributes of the DataFrame

A DataFrame has besides a index attribute, also a columns attribute:


In [29]:
countries.index


Out[29]:
Int64Index([0, 1, 2, 3, 4], dtype='int64')

In [30]:
countries.columns


Out[30]:
Index(['area', 'capital', 'country', 'population'], dtype='object')

To check the data types of the different columns:


In [31]:
countries.dtypes


Out[31]:
area            int64
capital        object
country        object
population    float64
dtype: object

An overview of that information can be given with the info() method:


In [32]:
countries.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 4 columns):
area          5 non-null int64
capital       5 non-null object
country       5 non-null object
population    5 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 200.0+ bytes

Also a DataFrame has a values attribute which returns its numpy representation:


In [33]:
countries.values


Out[33]:
array([[30510, 'Brussels', 'Belgium', 11.3],
       [671308, 'Paris', 'France', 64.3],
       [357050, 'Berlin', 'Germany', 81.3],
       [41526, 'Amsterdam', 'Netherlands', 16.9],
       [244820, 'London', 'United Kingdom', 64.9]], dtype=object)

If we don't like what the index looks like, we can reset it and set one of our columns:


In [34]:
countries = countries.set_index('country')
countries


Out[34]:
area capital population
country
Belgium 30510 Brussels 11.3
France 671308 Paris 64.3
Germany 357050 Berlin 81.3
Netherlands 41526 Amsterdam 16.9
United Kingdom 244820 London 64.9

To access a Series representing a column in the data, use typical indexing syntax:


In [35]:
countries['area']


Out[35]:
country
Belgium            30510
France            671308
Germany           357050
Netherlands        41526
United Kingdom    244820
Name: area, dtype: int64

As you play around with DataFrames, you'll notice that many operations which work on NumPy arrays will also work on dataframes.

For example there's arithmetic. Let's compute density of each country:


In [36]:
countries['population']*1000000 / countries['area']


Out[36]:
country
Belgium           370.370370
France             95.783158
Germany           227.699202
Netherlands       406.973944
United Kingdom    265.092721
dtype: float64

Adding a new column to the dataframe is very simple:


In [37]:
countries['density'] = countries['population']*1000000 / countries['area']
countries


Out[37]:
area capital population density
country
Belgium 30510 Brussels 11.3 370.370370
France 671308 Paris 64.3 95.783158
Germany 357050 Berlin 81.3 227.699202
Netherlands 41526 Amsterdam 16.9 406.973944
United Kingdom 244820 London 64.9 265.092721

And we can do things like sorting the items in the array, and indexing to take the first two rows:


In [38]:
countries.sort_values(by='density', ascending=False)


Out[38]:
area capital population density
country
Netherlands 41526 Amsterdam 16.9 406.973944
Belgium 30510 Brussels 11.3 370.370370
United Kingdom 244820 London 64.9 265.092721
Germany 357050 Berlin 81.3 227.699202
France 671308 Paris 64.3 95.783158

One useful method to use is the describe method, which computes summary statistics for each column:


In [39]:
countries.describe()


Out[39]:
area population density
count 5.000000 5.000000 5.000000
mean 269042.800000 47.740000 273.183879
std 264012.827994 31.519645 123.440607
min 30510.000000 11.300000 95.783158
25% 41526.000000 16.900000 227.699202
50% 244820.000000 64.300000 265.092721
75% 357050.000000 64.900000 370.370370
max 671308.000000 81.300000 406.973944

The plot method can be used to quickly visualize the data in different ways:


In [40]:
countries.plot()


Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f65f9636a20>

However, for this dataset, it does not say that much:


In [41]:
countries['population'].plot(kind='bar')


Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f65f7d3ecf8>

You can play with the kind keyword: 'line', 'bar', 'hist', 'density', 'area', 'pie', 'scatter', 'hexbin'

EXERCISE: Define a `DataFrame` with two columns: price and volume of each beverage. Use the bevarage name as the index. Add the new column ``price of litre``, the data frame according to the values of the column, and plot the values as a bar plot.

In [ ]:


In [ ]:

Importing and exporting data

A wide range of input/output formats are natively supported by pandas:

  • CSV, text
  • SQL database
  • Excel
  • HDF5
  • json
  • html
  • pickle
  • ...

In [44]:
pd.read_csv


Out[44]:
<function pandas.io.parsers._make_parser_function.<locals>.parser_f>

In [45]:
countries.to_csv


Out[45]:
<bound method DataFrame.to_csv of                   area    capital  population     density
country                                                  
Belgium          30510   Brussels        11.3  370.370370
France          671308      Paris        64.3   95.783158
Germany         357050     Berlin        81.3  227.699202
Netherlands      41526  Amsterdam        16.9  406.973944
United Kingdom  244820     London        64.9  265.092721>

Acknowledgement

© 2015, Stijn Van Hoey and Joris Van den Bossche (mailto:stijnvanhoey@gmail.com, mailto:jorisvandenbossche@gmail.com).

© 2015, modified by Bartosz Teleńczuk (original sources available from https://github.com/jorisvandenbossche/2015-EuroScipy-pandas-tutorial)

Licensed under CC BY 4.0 Creative Commons

This notebook is partly based on material of Jake Vanderplas (https://github.com/jakevdp/OsloWorkshop2014).



In [ ]: